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Abstract 

Background: Tree reconciliation problems have long been studied in phylogenetics. A particular variant of the 
reconciliation problem for a gene tree T and a species tree S assumes that for each interior vertex x of T it is 
known whether x represents a speciation or a duplication. This problem appears in the context of analyzing 
orthology data. 

Results: We show that S is a species tree for T if and only if S displays all rooted triples of Tthat have three 
distinct species as their leaves and are rooted in a speciation vertex. A valid reconciliation map can then be found 
in polynomial time. Simulated data shows that the event-labeled gene trees convey a large amount of information 
on underlying species trees, even for a large percentage of losses. 

Conclusions: The knowledge of event labels in a gene tree strongly constrains the possible species tree and, for a 
given species tree, also the possible reconciliation maps. Nevertheless, many degrees of freedom remain in the 
space of feasible solutions. In order to disambiguate the alternative solutions additional external constraints as well 
as optimization criteria could be employed. 



Background 

The reconstruction of the evolutionary history of a gene 
family is necessarily based on at least three interrelated 
types of information. The true phylogeny of the investi- 
gated species is required as a scaffold with which the 
associated gene tree must be reconcilable. Orthology or 
paralogy of genes found in different species determines 
whether an internal vertex in the gene tree corresponds 
to a duplication or a speciation event. Speciation events, 
in turn, are reflected in the species tree. 

The reconciliation of gene and species trees is a widely 
studied problem [1-10]. In most practical applications, 
however, neither the gene tree nor the species tree can 
be determined unambiguously. 

Although orthology information is often derived from 
the reconciliation of a gene tree with a species tree (cf. e.g. 
TreeFam [11], PhyOP [12], PHOG [13], EnsemblCompara 
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GeneTrees [14], and MetaPhOrs [15]), recent benchmarks 
studies [16] have shown that orthology can also be 
inferred at similar levels of accuracy without the need to 
construct trees by means of clustering-based approaches 
such as OrthoMCL [17], the algorithms underlying the 
COG database [18,19], InParanoid [20], or ProteinOrtho 
[21]. In [22] we have therefore addressed the question: 
how much information about the gene tree, the species 
tree, and their reconciliation is already contained in the 
orthology relation between genes? 

According to Fitch's definition [23], two genes are (co-) 
orthologous if their last common ancestor in the gene 
tree represents a speciation event. Otherwise, i.e., when 
their last common ancestor is a duplication event, they 
are paralogs. The orthology relation on a set of genes is 
therefore determined by the gene tree T and an "event 
labeling" that identifies each interior vertex of T as either 
a duplication or a speciation event. (We disregard here 
additional types of events such as horizontal transfer and 
refer to [22] for details on how such extensions might be 
incorporated into the mathematical framework.) One of 
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the main results of [22], which relies on the theory of 
symbolic ultrametrics developed in [24], is the following: 
a relation on a set of genes is an orthology relation (i.e., it 
derives from some event-labeled gene tree) if and only if 
it is a cograph (for several equivalent characterizations of 
cographs see [25]). Note that the cograph does not con- 
tain the full information on the event-labeled gene tree. 
Instead the cograph is equivalent to the gene tree's 
homomorphic image obtained by collapsing adjacent 
events of the same type [22]. The orthology relation thus 
places strong and easily interpretable constraints on the 
gene tree. 

This observation suggests that a viable approach to 
reconstructing histories of large gene families may start 
from an empirically determined orthology relation, which 
can be directly adjusted to conform to the requirement 
of being a cograph. The result is then equivalent to an 
(usually incompletely resolved) event-labeled gene tree, 
which might be refined or used as constraint in the infer- 
ence of a fully resolved gene tree. In this contribution we 
are concerned with the next conceptual step: the deriva- 
tion of a species tree from an event-labeled gene tree. As 
we shall see below, this problem is much simpler than 
the full tree reconciliation problem. Technically, we will 
approach this problem by reducing the reconciliation 
map from gene tree to species tree to rooted triples of 
genes residing in three distinct species. This is related to 
an approach that was developed in [26] for addressing 
the full tree reconciliation problem. 

Methods 

Definitions and notation 

Phylogenetic trees 

A phylogenetic tree T (on L) is a rooted tree T = (V, E), 
with leaf set X £ V , set of directed edges E, and set of 
interior vertices V° = V\L that does not contain any ver- 
tices with in- and outdegree one and whose root prs V 
has indegree zero. In order to avoid uninteresting trivial 
cases, we assume that \L\ > 3. The ancestor relation =4t 
on V is the partial order defined, for all x, y g V , by 
x=4tY whenever y lies on the path from x to the root. If 
there is no danger of ambiguity, we will write x ^ y 
rather than x^tY- Furthermore, we write ^ ^ y to mean 
X 4y and x ^ y. For xg V,we write L{x) := {y € L\y ^ x} 
for the set of leaves in the subtree T (x) of T rooted in x. 
Thus, L{Pt ) = L and T (px ) - T . For x,yGV such that 
X and y are joined by an edge e g £ we write 
e = [y, x\ ifx ^ y. Two phylogenetic trees T = {V, E) and 
T = {V, E') on L are said to be equivalent if there exists a 
bijection from VtoV that is the identity on L, maps pT 
to Pt\ and extends to a graph isomorphism between T 
and r '. A refinement of a phylogenetic tree T on L is a 
phylogenetic tree T on L such that T can be obtained 
from T by collapsing edges (see e.g. [27]). Suppose for 



the remainder of this section that T = (V, £) is a phyloge- 
netic tree on L with root pr ■ For a non-empty subset of 
leaves AQ L,we define Icaj- {A), or the most recent com- 
mon ancestor of A, to be the unique vertex in T that is 
the greatest lower bound of A under the partial order ^t- 
In case A = {x, y}, we put Ica^ {x, y) := Ica^ {{x, y}) and if 
A = {x, y, z}, we put Ica^ {x, y, z) := Icaj- {{x, y, z}). For 
later reference, we have, for all e V , that x = Ica^ (L 
(x)). Let i' c Z, be a subset of \L'\ > 2 leaves of T. We 
denote by T {L') = T (Ica^ {L')) the (rooted) subtree of T 
with root Icaj- (i'). Note that T{L') may have leaves that 
are not contained in L'. The restriction T\ii of T to L is 
the phylogenetic tree with leaf set L' obtained from T by 
first forming the minimal spanning tree in T with leaf set 
L' and then by suppressing all vertices of degree two with 
the exception of pr if pr is a vertex of that tree. A phylo- 
genetic tree T' on some subset L' £ L is said to be dis- 
played by T (or equivalently that T displays T') if T' is 
equivalent with tree T\ii. A set ^ of phylogenetic trees T 
each with leaf set Lt is called consistent if = 0 or there 
is a phylogenetic tree T on L = UTe^rLj that displays S^, 
that is, r displays every tree contained in ^. Note that a 
consistent set of phylogenetic trees is sometimes also 
called compatible (see e.g. [27]). 

It will be convenient for our discussion below to 
extend the ancestor relation on V to the union of 
the edge and vertex sets of T. More precisely, for the 
directed edge e = [m, v] e E we put x^jc if and onfly if 
x^tV and e<TX if and only if u=^ex. For edges e = [m, v] 
and/= \a, b] in T we put e =4 f if and only ifv =4b. 
Rooted triples 

Rooted triples are phylogenetic trees on three leaves 
with precisely two interior vertices. Sometimes also 
called rooted triplets [28] they constitute an important 
concept in the context of supertree reconstruction 
[27,29] and will also play a major role here. Suppose L = 
{x, y, z}. Then we denote by {{x, y), z) the triple r with 
leaf set L for which the path from x to y does not inter- 
sect the path from z to the root pr and thus, having. 
lcar(.x:, y) < lcar(x, y,z) For T a phylogenetic tree, we 
denote by SR(T) the set of all triples that are displayed 
by T. 

Clearly, a set ^% of triples is consistent if there is a 
phylogenetic tree T on X = Ure3iL{pr) such that 
M C .^(T). Not all sets of triples are consistent of 
course. Given a triple set ^ there is a polynomial-time 
algorithm, referred to in [27] as BUILD, that either con- 
structs a phylogenetic tree T that displays or that 
recognizes that ^% is inconsistent, that is, not consistent 
[30]. Various practical implementations have been 
described starting with [30], improved variants are dis- 
cussed in [31,32]. 

The problem of determining a maximum consistent 
subset of an inconsistent set of triples, on the other 
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hand is NP-hard and also APX-hard, see [33,34] and the 
references therein. We refer to [35] for an overview on 
the available practical approaches and further theoretical 
results. 

The BUILD algorithm, furthermore, does not necessa- 
rily generate for a given triple set M a minimal phyloge- 
netic tree T that displays ^, i.e., T may resolve 
multifurcations in an arbitrary way that is not implied 
by any of the triples in ^. However, the tree generated 
by BUILD is minor-minimal, i.e., if T' is obtained from 
T by contracting an edge, T does not display M any- 
more. The trees produced by BUILD do not necessarily 
have the minimum number of internal vertices. Thus, 
depending on not all trees consistent with ^ can be 
obtained from BUILD. Semple [36] gives an algorithm 
that produces all minor-minimal trees consistent with 
It requires only polynomial time for each of the pos- 
sibly exponentially many minor-minimal trees. The pro- 
blem of constructing a tree consistent with ^ and 
minimizing the number of interior vertices is NP-hard 
and hard to approximate [37]. 

Event labeling, species labeling, and reconciliation map 

A gene tree T arises through a series of events along a 
species tree S. We consider both T and S as phyloge- 
netic trees with leaf sets L (the set of genes) and B (the 
set of species), respectively. We assume that |L| > 3 and 
\B\ > 1. We consider only gene duplications and gene 
losses, which take place between speciation events, i.e., 
along the edges of S. Speciation events are modeled by 
transmitting the gene content of an ancestral lineage to 
each of its daughter lineages. 

The true evolutionary history of a single ancestral gene 
thus can be thought of as a scenario such as the one 
depicted in Figure 1. Since we do not consider horizontal 
gene transfer or lineage sorting in this contribution, an 
evolutionary scenario consists of four components: (1) A 
true gene tree f, (2) a true species tree 5, (3) an assign- 
ment of an event type (i.e., speciation ♦, duplication 
loss (g), or observable (extant) gene O) to each interior 
vertex and leaf of f, and (4) a map ^ assigning every ver- 
tex of "f to a vertex or edge of 5 in such a way that (a) the 
ancestor order of f is preserved, (b) a vertex of f is 
mapped to an interior vertex of § if and only if it is of 
type speciation, (c) extant genes of f are mapped to 
leaves of S. Alternatively, one could define f and s to be 
metric graphs (i.e., comprising edges that are real inter- 
vals glued together at the vertices) with a distance func- 
tion that measures evolutionary time. In this picture, jl is 
a continuous map that preserves the temporal order and 
satisfied conditions (b) and (c). 

In order to allow jl to map duplication vertices to a 
time point before the last common ancestor of all spe- 
cies in 5, we need to extend our definition of a species 



tree by adding an extra vertex and an extra edge 
"above" the last common ancestor of all species. Note 
that strictly speaking 5 is not a phylogenetic tree any- 
more. In case there is no danger of confusion, we will 
from now on refer to a phylogenetic tree on B with this 
extra edge and vertex added as a species tree on B and 
to pB as the root of B. Also, we canonically extend our 
notions of a triple, displaying, etc. to this new type of 
species tree. 

The true gene tree f represents all extant as well as all 
extinct genes, all duplication, and all speciation events. 
Not all of these events are observable from extant genes 
data, however. In particular, extinct genes cannot be 
observed. The observable part T = T {V, E) of f is the 
restriction of f to the leaf set L of extant genes, i.e., 
T = f\i. 

Furthermore, we can observe a map a: L ^ B that 
assigns to each extant gene the species in which it resides. 
Of course, for x e L we have cr{x] = jl[x). Here B is the 
leaf set of the extant species tree, i.e., B = a{L). For ease of 
readability, we also put a{T) = {u{x): x e Liy)} for any sub- 
tree T of r with T = T (y) where y e V°. Alternatively, we 
will sometimes also write a{y) instead of a{T (y)). Last but 
not least, for YQ L,we put a{Y) = {a(y): yeY}. 

The observable part of the species tree S = (W H) is 
the restriction s\b of S ^ B. In order to account for 
duplication events that occurred before the first specia- 
tion event, the additional vertex ps e W and the addi- 
tional edge [ps, \c&sB] e H must be part of S. 

The evolutionary scenario also implies an event labeling 
map £ : V —>-{•,□ , ©} that assigns to each interior vertex 
V of r a value t{v) indicating whether v is a speciation 
event (•) or a duplication event (n). It is convenient to use 
the special label O for the leaves xoiT . We write {T,t) for 
the event-labeled tree. We remark that t was introduced as 
"symbolic dating map" in [24]. It is called discriminating if, 
for all edges {u, v} e E, we have t{u) ^ t{v) in which case 
{T,t) is known to be in 1-1-correspondence to a cograph 
[22]. Note that we will in general not require that t is dis- 
criminating in this contribution. For T = {V,E) z. gene tree 
on i, B a set of species, and maps t and a as specified 
above, we require however that [i and a must satisfy the 
following compatibility property: 

(C) Let z G 1/ be a speciation vertex, i.e., t{z) = •, and 
let T' and T" be subtrees of T rooted in two distinct 
children of z. Then a (r) n a (T") = 0. 

Note the we do not require the converse, i.e., from the 
disjointness of the species sets a {T) and o{T") we do 
not conclude that their last common ancestor is a spe- 
ciation vertex. 

For X, y & L and z = Ic&t {x, y) it immediately follows 
from condition (C) that if t{\caT {x, y)) = * then o{x) ^ 
a{y) since, by assumption, x and y are leaves in distinct 
subtrees below z. Equivalently, two distinct genes x * y 
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Figure 1 Gene trees. Left: Example of an evolutionary scenario showing the evolution of a gene family. The corresponding true gene tree ^ 
appears embedded in the true species tree ^. The map jx is implicitly given by drawing the species tree superimposed on the gene tree. In 
particular, the speciation vertices in the gene tree (red circuits) are mapped to the vertices of the species tree (gray ovals) and the duplication 
vertices (blue squares) to the edges of the species tree. Gene losses are represented with "®" (mapping to edges in ^). The observable species a 
b,..., f are the leaves of the species tree (green ovals) and extant genes therein are labeled with "0". Right: The corresponding gene tree Twith 
observed events from the left tree. Leaves are labeled with the corresponding species. 



in L for which u{x) = a(jy) holds, that is, they are con- 
tained in the same species of B, must have originated 
from a dupUcation event, i.e., t{\ca.T {x, y)) = n. Thus we 
can regard a as a proper vertex coloring of the cograph 
corresponding to (T, t). 

Let us now consider the properties of the restriction of /t 
to the observable parts T oij and S of 5. Consider a spe- 
ciation vertex x in -f. If x has two children y' and y" so that 
Liy") and L{y") are both non-empty then x = lcaj,(z', z") 
for all z' e L{y') and z" e L{y") and hence, x = Icaj- 
{L(y')U{L(y")). In particular, x is an observable vertex in T. 
Furthermore, we know that cr(L(/)) n CT(L(y")) = 0, and 
therefore,/}-^ = lcas((T(L(y') U L(y")). Considering all 
pairs of children with this property this can be rephrased 
as AW = lcas(cr(L(x))). On the other hand, if x does not 
have at least two children with this property, and hence 
the corresponding speciation vertex cannot be viewed as 
most recent common ancestor of the set of its descendants 
in S, then x is not a vertex in the restriction T = T |£ of f" to 
the set L of the extant genes. The restriction ^ of /t to the 
observable tree T therefore satisfies the properties used 
below to define reconciliation maps. 

Definition 1. Suppose that B is a set of species, that S = 
{W, H) is a phylogenetic tree on B, that T = (V,E) is a gene 



tree with leaf set L and that a : L ^ B and 
t : V — > {•, □, 0} are the maps described above. Then we 
say that S is a species tree for {T,t, a) if there is a map pi : 
V^WUH such that, for all x& V: 

(i) Ift{x) = O then fi (x) = a (x). 

(ii) Ift{x) = . then iJ.{x)&W\B. 
(Hi) If t{x) = n then f^{x) e H. 

(iv) Let X, y & V with x<jy. We distinguish two cases: 

1. If t{x) = t(y) = ° then /i(x)^s/x(y) in S. 

2. Ift{x) = t{y) = ' or t(x) * t{y) then /x(x)^sM(y) in S. 

(v) If t{x) = ♦ then /^{x) = \c3ls(o{L{x))) 

We call fi the reconciliation map from (T,t, o) to S. 

We note that li (ps) = 0 holds as an immediate con- 
sequence of property (v), which implies that no specia- 
tion node can be mapped above lcas(5), the unique 
child of ps- 

We illustrate this definition by means of an example 
in Figure 2 and remark that it is consistent with the 
definition of reconciliation maps for the case when the 
event labeling t on T is not known [38]. Continuing 
with our notation from Definition 1 for the remainder 
of this section, we easily derive their axiom set as 
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Gene Tree 



Species Tree 




Figure 2 Mapping fj. Example of the mapping ^ of nodes of the gene tree T to the species tree 5. Speciation nodes in the gene tree (red 
circles) are mapped to nodes in the species tree, duplication nodes (blue squares) are mapped to edges in the species tree, c is shown as 
dashed green arrows. For clarity of exposition, we have identified the leaves of the gene tree on the left with the species they reside in via the 
map a. 



Lemma 2. If fi is a reconciliation map from {T,t, o) to 
S and L is the leaf set of T then, for all x & V. 

(Dl) X £ L implies fi {x) = a (x). 

(D2.a) fi{x) e W implies fi (x) = lcas(o {L(x))). 

(D2.b) /A (x) e H implies lcas(CT(L(x;)))-<s/x(x). 

(D3) Suppose x, y & V such that x^tY. If {x), fi iy) e 
H then At(^)^sM(y); otherwise /x(x)^s/i(y). 

Proof. Suppose x £ V. Then (Dl) is equivalent to (i) 
and the fact that t{x) = O if and only if ;ic e L. Condi- 
tions (ii) and (v) together imply (D2.a). If fi {x) e H 
then X is duplication vertex of T. From condition (iv) we 
conclude that lcas(a(L(x)))^s/x(x). Since Icusio {L{x))) 
e W, equality cannot hold and so (D2.b) follows. (D3) is 
an immediate consequence of (iv). □ 

For T a gene tree, B a set of species and maps a and t 
as above, our goal is now to characterize (1) those {T,t, 
a) for which a species tree on B exists and (2) species 
trees on B that are species trees for {T,t, a). 



Results and discussion 

Results 

Unless stated otherwise, we continue with our assump- 
tions on B, (T,t, a), and S as stated in Definition 1. We 
start with the simple observation that a reconciliation 
map from {T,ti a) to S preserves the ancestor order of T 
and hence T imposes a strong constraint on the rela- 
tionship of most recent common ancestors in S: 

Lemma 3. Let : V ^ W \J H be a reconciliation map 
from {T,t, a) to S. Then 



(1) 



holds for all x, y & V. 

Proof. Assume that x and y are distinct vertices of T. 
Consider the unique path P connecting x with y. P is 
uniquely subdivided into a path P' from x to Ica^ {x, y) 
and a path P" from Ica^ {x, y) to y. Condition (iv) 
implies that the images of the vertices of P' and P" 
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under fi, resp., are ordered in 5 with regards to and 
hence are contained in the intervals Q'' and Q" that con- 
nect fi{\caT (x, y)) with fi(x) and fi{y), respectively. In 
particular ^(Icaj- {x, y)) is the largest element (w.r.t.=^s) 
in the union of Q' U Q" which contains the unique path 
from fi{x) to pi{y) and hence also lcas(ji(x), fi{y)). ° 

Equation (1) is well known to hold for gene tree/spe- 
cies reconciliation in the absence of a prescribed event 
labeling in T. 

Since a phylogenetic tree (in the original sense) T is 
uniquely determined by its induced triple set 3t(T), it 
is reasonable to expect that all the information on the 
species tree(s) for {T,t, a) is contained in the images 
of the triples in ^R[T) (or more precisely their leaves) 
under a. However, this is not the case in general as 
the situation is complicated by the fact that not all 
triples in fR{T) are informative about a species tree 
that displays T. The reason is that duplications may 
generate distinct paralogs long before the divergence 
of the species in which they eventually appear. To 
address this problem, we associate to {T,t, a) the set 
of triples 

6 " ©(T, t, o) - {r e 3l(T)|t(lcaT(r)) - •andoM ^a{y), for all Jcy e L(r) pairwise distinct}. (2) 

As we shall see below, C5 (T, t, cr) contains all the 
information on a species tree for {T,t, o) that can be 
gleaned from {T,t, a). 

Lemma 4:. If fi is a reconciliation map from {T,t, a) to 
S and {{x,y),z) e 0(T, t, a) then S displays i{o{x), a 

(y)), cT(z)). 

Proof. Put © = ©(T, t, a) and recall that L denotes the 
leaf set of T. Let {x,y,z} e (3) and assume w.l.o.g. that 
((x, y),z) e 0. First consider the case that tilczr (x, y)) = 
.. From condition (v) we conclude that ^(Icaj- {x, y)) = 
lca.s(a{x), a(y)) and /^(\ca.T {x, y, z)) = lca5(o(:v:), o(j), 
a{z)). Since, by assumption, lcaT(x,y) -< lcaT(x, y, z), we 
have as a consequence of condition (iv) that 
IX (lcar(x, y)) -< yU.(lcar(x, y, z)). From Icar {x, z) = Ic&t 
{y, z) = Icar {x, y, z) we conclude that S must display ((a 
{x), aiy)), c{z)) as S is assumed to be a species tree for 
(T,t, o). 

Now suppose that tilcar (x, y)) = ° and therefore, ^ 

(Icaj- (x, y)) e H. Moreover, ^ (Icaj- (x, y, z)) e W holds. 
Hence, Lemma 3 and property (iv) together imply that 
lcas(a(x), (T(y))^sM(lcaT(x,y))^sM(lcaT(x,y,z)). Thus, 
we again obtain that the triple ((o(x), 0(7)), a(z)) is dis- 
played by S. n 

It is important to note that a similar argument cannot 
be made for triples in ?li[T) rooted in a duplication ver- 
tex of T as such triplets are in general not displayed by 
a species tree for {T,t, a). We present the generic coun- 
terexample in Figure 3. To state our main result (Theo- 
rem 6), we require a further definition. 



Definition 5. For (T,t, a), we define the set 

6.e(T, 1, u). |((a,ii),c)P((»,)'),z) s6(T,t,<j)mtha[x).a,a[y).h, and<r[i).c\ (3) 

As an immediate consequence of Lemma 4, 
(S(T, t, a) must be displayed by any species tree for {T, 
t, a) with leaf set B. 

Theorem 6. Let S be a species tree with leaf set B. 
Then there exists a reconciliation map ^ from (T,t, o) to 
S whenever S displays all triples in &{T, t, cr). 

Proof Recall that L is the leaf set of T = {V, E). Put S 
= {W, H) and e = 6(T, t, a). We first consider the 
subset G := {x e V\t{x) e {•, 0}} of V comprising of the 
leaves and speciation vertices of T. 

We explicitly construct the map fi : G ^ W as fol- 
lows. For all ;v G V , we put 

(Ml) i£{x) = cr{x) if t{x) = 0, 

(M2) fi{x) = lcas(o(L(x))) if t{x) = .. 

Note that alternative (Ml) ensures that ft satisfies Con- 
dition (i). Also note that in view of the simple conse- 
quence following the statement of Condition ( C) we have 
for all X G Vwith t{x) = ♦ that there are leaves y, /' e L{x) 
with a(y') ^ <y{y")- Thus lcas(^(Z,(x)) e W\ B, i.e. fi satisfies 
Condition (ii). Also note that, by definition, alternative 
(M2) ensures that fi satisfies Condition (v). 

Claim: li x, y e G with x<Ty then /i(x)-;sM(y)- 

Since y cannot be a leaf of T as x<Ty we have t{y) = ♦. 
There are two cases to consider, either t{x) = ♦ or 
t(x) = 0. In the latter case fi{x) = o(x) e B while f^iy) & 
W \ B as argued above. Since x e L{y) we have 
/u,(x)^sM(y)> as desired. 

Now suppose t{x) = ♦. Again by the simple conse- 
quence following Condition (C), there are leaves x, x" e 
L(x) with a = a{x') * a(x") = b. Since x-<Ty and t(y) = », 
by Condition (C), we conclude that c = o(y') i a{L{x)) 
holds for all / e L{y) \ L{x). Thus,((a, b), c) e 6. But 
then {(a, b), c) is displayed by S and therefore 
leas (a, h)^s\cas{a, h, c).. Since this holds for all triples 
((x',x"),/) e 0 with x', x" e L{x) and / e L{y) \ L{x) 

we conclude ^ W - lcas(<r(L(j:)))^slcas(<r(L(j:)) u,r(L(rt\L(ar))) - lcas(<r(L{rt)) - mW, 

establishing the claim. It follows immediately that ^ also 
satisfies Condition (iv.2) if x and y are contained in G. 

Next, we extend the map ^ to the entire vertex set V 
of T using the following observation. Let x & V with t 
(x) = n. We know by Lemma 3 that ia.{x) is an edge [u, 
v] e H so that lcas(cr(L(x)))=^si'- Such an edge exists 
for V = lca5(a(L(x))) by construction. Every speciation 
vertex y & V with x<jY therefore necessarily maps 
above this edge, i.e., w^sAt(y) must hold. Thus we set 

(M3) n{x) = [u, lcas(a(Z.(x)))] if t{x) = □. 

which now makes fi a map from V to W U H. 

By construction. Conditions (Hi), (iv.2) and (v) are thus 
satisfied by ^. On the other hand, if there is a speciation 
vertex y between two duplication vertices x and x' of T , 
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Figure 3 Triples with duplication event at the root. Triples from T whose root is a duplication event are in general not displayed from the 
species tree S. (a) Triple with duplication event at the root obtained from the true evolutionary history of T shown in panel (b). Panel (c) is the 
true species tree. In the triple (a) the species y appears as the outgroup even though the x is the outgroup in the true species tree. 



i.e., x<Ty<T^i then /x(ac)^s/x(x'). Thus ^ also satisfies 
Condition (iv.l). 

It follows that ^ is a reconciliation map from {T,t, a) 
to S. □ 

Corollary 7. Suppose that S is a species tree for (T,t, 
a) and that L and B are the leaf sets of T and S, respec- 
tively. Then a reconciliation map pi from {T,t, o) to S 
can be constructed in 0(|L||B|). 

Proof. In order to find the image of an interior vertex 
X oi T under ^, it suffices to determine o {L{x)) (which 
can be done for all x simultaneously, e.g. by bottom up 
transversal of Tin 0{|L||£|) time) and lcas(a(L(x))). The 
latter task can be solved in linear time using the idea 
presented in [39] to calculate the lowest common ances- 
tor for a group of nodes in the species tree, n 

We remark that given a species tree 5 on S that dis- 
plays all triples in ©(T, t, cr), there is no freedom in the 
construction of a reconciliation map on the set 
{x e V|t(x) e {•, O}}. The duplication vertices of T, how- 
ever, can be placed differently, resulting in possibly expo- 
nentially many reconciliation maps from {T,t, a) to S. 

Lemma 4 implies that consistency of the triple set 
&[T,t,a) is necessary for the existence of a reconcilia- 
tion map from {T,t, a) to a species tree on B. Theorem 
6, on the other hand, establishes that this is also suffi- 
cient. Thus, we have 

Theorem 8. There is a species tree on B for {T,t, a) if 
and only if the triple set &[T, t, cr) is consistent. 

We remark that a related result is proven in [26, The- 
orem.5] for the full tree reconciliation problem starting 
from a forest of gene trees. 

It may be surprising that there are no strong restric- 
tions on the set &[T, t, cr) of triples that are implied by 
the fact that they are derived from a gene tree {T,t, a). 



Theorem 9. For every set x of triples on some finite set 
B of size at least one there is a gen e tree T = {V, E) 
with leaf set L together with an event map 
t : V ^ {•, □, O} and a map a : L ^ B that assigns to 
every leaf of T the species in B it resides in, such that 
x=&[T,t,a). 

Proof. Irrespective of whether x is consistent or not we 
construct the components of the required 3-tuple {T,t, 
o) as follows: To each triple r;; = {{xki,Xk2),Xk3,) e * we 
associate a triple Tk = {{aui, aui), au?,) via a map 
cTfe : Lfe = {aki,ak2,a]s\ {Xk\,Xk2,Xki] with a(afe,) = xu 
for / = 1, 2, 3 where we assume that for any two distinct 
triples rfe, n e x we have that ak{Lk) n a/(i/ ) = /). Then 
we obtain T = (V, £) by first adding a single new vertex 
Pt- to the union of the vertex sets of the triples and 
then connecting pj- to the root of each of the triples 
Tj^. Clearly, T is a phylogenetic tree on L = VJruexL^Pk). 
Next, we define the map £ : V — >• {•, □, ©} by putting t 
ipT ) = °. t(fl) = O for all a e L and t(a) = ♦ for all a e 
V - (L \J {p T })■ Finally, we define the map o : L ^ B 
by putting, for all a e L, a{a) = a/^ (a) where a e L/^. 
Clearly 6(r,£,CT) =*. □ 

We remark that the gene tree constructed in the proof 
of Theorem 9 can be made into a binary tree by split- 
ting the root px into a series of duplication and loss 
events so that each subtree is the descendant of a differ- 
ent paralog. Since by Theorem. 9 there are no restric- 
tions on the possible triple sets &{T, t,a), it is clear that 
S will in general not be unique. An example is shown in 
Figure 4. 

Results for simulated gene trees 

In order to determine empirically how much informa- 
tion on the species tree we can hope to find in event 
labeled gene trees, we simulated species trees together 
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Figure 4 Inferred species trees. The set &{T, t, cr) inferred from the event labeled gene tree (T,t, a) does not necessarily define a unique 
species tree. For clarity of exposition, we have identified, via the map a, the leaves of the gene tree and of the set of triples &(T, t, cr) with 
the species they reside in. 



with corresponding event-labeled gene trees with differ- 
ent duplication and loss rates. Approximately 150 spe- 
cies trees with 10 to 100 species were generated 
according to the "age model" [40]. These trees are 
balanced and the edge lengths are normalized so that 
the total length of the path from the root to each leaf is 
1. For each species tree, we then simulated a gene tree 
as described in [41], with duplication and loss rate para- 
meters r e 0[1] sampled uniformly. Events are modeled 
by a Poisson distribution with parameter r • £, where £ 
is the length of an edge as generated by the age model. 
Losses were additionally constrained to retain at least 
one copy in each species, i.e., a{L) = B is enforced. After 
determining the triple set <3(T, t, cr) according to Theo- 
rem 6, we used BUILD [27] (see also [42]) to compute 
the species tree. In all cases BUILD returns a tree that is 
a homomorphic contraction of the simulated species 
tree. The difference between the original and the recon- 
structed species tree is thus conveniently quantified as 
the difference in the number of interior vertices. Note 
that in our situation this is the same as the split metric 
[27]. 

The results are summarized in Figure 5. Not surpris- 
ingly, the recoverable information decreases in particular 
with the rate of gene loss. Nevertheless, at least 50% of the 
splits in the species tree are recoverable even at very high 



loss rates. For moderate loss rates, in particular when gene 
losses are less frequent than gene duplications, nearly the 
complete information on the species tree is preserved. It is 
interesting to note that BUILD does not incorporate splits 
that are not present in the input tree, although this is not 
mathematically guaranteed. 

Discussion 

Event-labeled gene trees can be obtained by combining 
the reconstruction of gene phylogenies with methods for 
orthology detection. Orthology alone already encapsu- 
lates partial information on the gene tree. More pre- 
cisely, the orthology relation is equivalent to a 
homomorphic image of the gene tree in which adjacent 
vertices denote different types of events. We discussed 
here the properties of reconciliation maps fi from a gene 
tree T along with an event labelling map t and a gene to 
species assignment map a to a species tree S. We show 
that {T,t) event labeled gene trees for which a species 
tree exists can be characterized in terms of the set a of 
triples that is easily constructed from a subset of triples 
of T. Simulated data shows, furthermore, that such trees 
convey a large amount of information on the underlying 
species tree, even if the gene loss rate is high. 

It can be expected that for real-life data the tree T 
contains errors so that &:=&{T,t,a) may not be 
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Figure 5 Recovered splits in species trees. Left: Heat map that represents the percentage of recovered splits in the inferred species tree from 
triples obtained from simulated event-labeled gene trees with different loss and duplication rates. Right: Scattergram that shows the average of 
losses and duplications in the generated data and the accuracy of the inferred species tree. 



consistent. In this case, an approximation to the species 
tree could be obtained e.g. from a maximum consistent 
subset of ©. Although (the decision version of) this pro- 
blem is NP-complete [43,44], there is a wide variety of 
practically applicable algorithms for this task, see 
[35,45]. Even if © is consistent, the species tree is 
usually not uniquely determined. Algorithms to list all 
trees consistent with © can be found e.g. in [46,47]. A 
characterization of triple sets that determine a unique 
tree can be found in [48]. Since our main interest is to 
determine the constraints imposed by {T,t, a) on the 
species tree S, we are interested in a least resolved tree 
S that displays all triples in &. The BUILD algorithm 
and its relatives in general produce minor-minimal 
trees, but these are not guaranteed to have the minimal 
number of interior nodes. Finding a species tree with a 
minimal number of interior nodes is again a hard pro- 
blem [37]. At least, the vertex minimal trees are among 
the possibly exponentially many minor minimal trees 
enumerated by Semple's algorithms [36]. 

For a given species tree S, it is rather easy to find a 
reconciliation map fi from {T,t, a) to S. A simple solu- 
tion fi is closely related to the so-called LCA reconcila- 
tion: every node of T is mapped to the last common 
ancestor of the species below it, lcas(<j(L{x))) or to the 
edge immediately above it, depending on whether x is 
speciation or a duplication node. While this solution is 
unique for the speciation nodes, alternative mappings 
are possible for the duplication nodes. The set of 



possible reconciliation maps can still be very large 
despite the specified event labels. If the event labeling t 
is unknown, there is a reconciliation from any gene tree 
T to any species tree S, realized in particular by the 
LCA reconciliation, see e.g. [26,38]. The reconciliation 
then defines the event types. Typically, a parsimony rule 
is then employed to choose a reconciliation map in 
which the number of duplications and losses is mini- 
mized, see e.g. [1,4,5,9]. In our setting, on the other 
hand, the event types are prescribed. This restricts the 
possible reconciliation maps so that the gene tree can- 
not be reconciled with an arbitrary species tree any 
more. Since the observable events on the gene tree are 
fixed, the possible reconciliations cannot differ in the 
number of duplications. Still, one may be interested in 
reconciliation maps that minimize the number of loss 
events. An alternative is to maximize the number of 
duplication events that map to the same edge in S to 
account for whole genome and chromosomal duplica- 
tion events [9]. 

Conclusions 

Our approach to the reconciliation problem via event- 
labeled gene trees opens up some interesting new ave- 
nues to understanding orthology. In particular, the 
results in this contribution combined with those in [22] 
concerning cographs should ultimately lead to a method 
for automatically generating orthology relations that 
takes into account species relationships without having 
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to explicitly compute gene trees. This is potentially very 
useful since gene tree estimation is one of the weak 
points of most current approaches to orthology analysis. 
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